Compound Word Recombination for German LVCSR

نویسندگان

  • Markus Nußbaum-Thom
  • Amr El-Desoky Mousa
  • Ralf Schlüter
  • Hermann Ney
چکیده

Compound words are a difficulty for German speech recognition systems since they cause high out-of-vocabulary and word error rates. State of the art approaches augment the language model by the fragments of compounds in order to increase lexical coverage, lower the perplexity and out-of-vocabulary rate. The fragments are tagged in order to concatenate subsequent equally tagged fragments in the recognition result, but this does not guarantee the recombination of proper words. Such recombination techniques neglect the large vocabulary of the language model training data for recombination although most compounds are covered by it. In this paper, we investigate the use of this vocabulary for the recombination of compound words from the recognition result. The approach is tested on two large vocabulary tasks on top of full-word and fragment based language models and achieves good improvements of 3– 7% relative over the baseline compound-sensitive word error rate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches

This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perplexity and size are the criteria used to identify a balance between compound decomposition, which ...

متن کامل

Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR

German is a morphologically rich language having a high degree of word inflections, derivations and compounding. This leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities in the large vocabulary continuous speech recognition (LVCSR) systems. One of the main challenges in the German LVCSR is the recognition of the OOV words. For this purpose, data-driven morphem...

متن کامل

A Broadcast News Corpus for Evaluation and Tuning of German LVCSR Systems

Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR). We present in detail the structure of a manually segmented and annotated corpus including over 160 hours of German broadcast news, and propose it as an evaluation framework of LVCSR systems. We show our own experimental results on the corpus, achieved with a ...

متن کامل

Investigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR

For languages like German and Polish, higher numbers of word inflections lead to high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Thus, one of the main challenges in large vocabulary continuous speech recognition (LVCSR) is recognizing an open vocabulary. In this paper, we investigate the use of mixed type of sub-word units in the same recognition lexicon. Namely, m...

متن کامل

A hybrid approach to compounds in LVCSR

In several languages compound words form orthographic units, which complicates the task of ensuring good lexical coverage for large vocabulary continuous speech recognition (LVCSR). A common approach to the problem consists of first recognizing the compound constituents, followed by an automatic recompounding process. We describe an accurate compound module, which combines a rule-based approach...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011